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Abstract  -  A  generalized  classification 
methodology  which  employs  risk  factor  fusion, 
normalization,  DKLT  based  transformation,  fea¬ 
ture  selection,  and  parametric  classifier  design  is 
developed  to  predict  the  presence  or  absence  of  a 
multifactorial  disease.  The  validity  of  method  is 
demonstrated  by  applying  it  to  predict  the  occur¬ 
rence  of  gout  in  patients. 

1.  INTRODUCTION 

The  goal  in  this  paper  is  to  develop  a  data-ftision 
based  pattern  recognition  approach  to  predict  the 
presence  or  absence  of  a  multifactorial  disease 
from  a  set  of  risk  factors  thought  to  be  correlated 
with  the  disease.  The  prediction  problem  is  for¬ 
mulated  as  a  2-class  classification  problem  in 
which  the  two  classes  are  disease  and  no-disease. 
The  approach  developed  in  this  paper  includes 
risk  factor  fusion,  normalization,  DKLT  based 
transformation,  feature  selection,  and  parametric 
classifier  design.  In  order  to  demonstrate  the 
validity  of  the  approach,  the  prediction  of  gout, 
which  is  a  multifactorial  disease,  is  considered. 
The  goal  is  to  classify  a  patient  into  one  of  two 
classes:  gout  or  non-gout.  The  approach  for  gout 
classification  is  summarized  in  Figure  1. 

2.  DATA  FUSION  AND  NORMALIZATION 
The  risk  factors  of  a  mutifactorial  disease  can  be 
fused  by  forming  a  vector  in  which  each  element 
v,  of  the  vector  is  a  risk  factor.  If  s  is  the  number 
of  risk  factors,  let  the  feature  vector  consisting  of 
the  fusion  of  the  s  risk  factors  be  denoted  by 
V  -  {v1,v2,...,v^}  .  Each  patient,  therefore,  is 
represented  by  a  single  feature  vector.  Let  the 
feature  vector  of  the  patients  with  the  disease  be 
denoted  by  Vm  ={xl,x2,—,  xs }  and  those  who 

do  not  have  the  disease  be  V"  =  {yl,y2,—,  ys}  , 
respectively.  Additionally,  let  the  feature  vector 
of  a  test  patient  be  Vt  =  {t1,t2,...,  ts}  .  In  general, 
the  feature  vector  will  contain  features  with 
mixed  formats  because  the  risk  factors  can  be 
real-valued  with  widely  differing  ranges  as  well 
as  binary  factors.  In  order  to  facilitate  classifier 
development,  the  features  can  be  normalized. 
For  example,  real-valued  features  can  be  linearly 
normalized  to  take  real  values  in  the  interval 
[0.1,  0.9]  and  for  the  binary  features,  zeros  and 


ones  can  be  assigned  values  0.1  and  0.9,  respec¬ 
tively.  The  motivation  for  developing  this  nor¬ 
malization  approach  is  to  not  only  accommodate 
features  of  mixed  formats  but  also  to  initially 
assign  equal  weightage  to  all  features.  That  is, 
although  it  is  known  that  the  risk  factors  are 
likely  to  have  varying  degrees  of  correlation  with 
the  disease,  no  assumptions  are  made  initially 
about  the  influence  of  these  factors  on  the 
prediction  of  the  disease. 

3.  CLASSIFIER  DEVELOPMENT 
Deciding  whether  a  test  patient  has  or  does  not 
have  the  disease  can  be  formulated  as  a  hypothe¬ 
sis  testing  problem  in  which: 

H0:  Vt  =  V"  ;  the  patient  does  not  have  the 
disease. 

Hx :  Vt  =  Vm  ;  the  patient  has  the  disease. 

In  order  to  facilitate  classifier  development,  the 
feature  vector  which  has  both  real  and  binary 
features,  can  be  transformed  so  that  the  trans¬ 
formed  feature  vector  contains  real-valued  fea¬ 
tures.  As  a  result,  it  would  be  much  easier  to 
make  meaningful  assumptions  for  the  condi¬ 
tional  densities  of  the  feature  vectors  under  the 
two  hypotheses.  The  discrete  Karhunen-Loeve 
transform  (DKLT)  is  a  suitable  transformation 
because  each  feature  in  the  transformed  vector  is 
a  linear  combination  of  the  features  in  the  origi¬ 
nal  feature  vector.  Let  V"  and  Vm  be  the  trans¬ 
formed  features  vectors.  That  is, 

V”  =  OF” 

Vm  =  OF'" 

where,  O  is  the  generalized  DKLT  transforma¬ 
tion  matrix  consisting  of  the  eigenvectors  of  the 
covariance  matrix  of  the  mixture  of  the  2  classes 
[1],  The  transformed  features,  which  are 
weighted  combinations  of  the  original  features, 
can  be  rank  ordered  in  terms  of  the  inter-class 
separation  in  order  to  select  the  features  which 
are  the  most  useful  for  separating  the  two 
classes.  If  %  and  jL  are  the  ith  components 

of  the  jth  transformed  feature  vector  for  the 
disease  and  no-disease  classes,  respectively, 
then,  the  inter-class  separation  between  the  trans- 
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formed  features  3c,-  and  yi ,  i  =  1,2,...,  s  ,  is  given 
by 

pm=— - —  ■ 

-*if  +  Ypy  -?/]2} 

7=1  7=1 

i  =  l,2,...,s , 

where,  3c,-  and  yt  are  the  mean  values  of  xi  and 
yt ,  respectively,  and  J  is  the  number  of  feature 
vectors  in  each  class. 

Let  V"  and  Vm  be  the  truncated  fea¬ 
ture  vectors  formed  by  selecting  the  k  compo¬ 
nents  with  the  highest  separations  between  the 

features  in  V"  and  V'”  ,  respectively.  The  two 
hypotheses  can,  therefore,  be  written  as 
H0  :  Vt=  V”  ;  the  patient  does  not  have  the 
disease. 

H j :  Vt=  Vm ;  the  patient  has  the  disease 
The  truncated  test  feature  vector  Vt  is  given  by 

vt  =  *vt, 

where,  O  is  the  matrix  formed  by  retaining  the  k 
eigenvectors  in  the  matrix  ®  that  yield  the  k 
highest  separation  features.  If  the  a  priori  prob¬ 
abilities  p(H0 )  and  p(Hx)  are  assumed  equal, 
the  likelihood  ratio  decision  rule  can  be  written 
as, 

Hx 

A=p(virn})  >  P(Hl)_l 

p(Vt/H0 )  <  P(H0) 

H0 

At  this  point,  assumptions  can  be  made  for  the 
conditional  densities  p{Vtl H0)  and  p{Vt!Hx). 
For  example,  if  it  is  assumed  that  the  conditional 
densities  are  Gaussian  and  if  M„  and  Mm  are 
the  mean  vectors  and  and  x¥m  are  the  co- 

variance  matrices  of  p(  Vt  /  H0 )  and  p(  Vt  /  H \ ), 
respectively,  the  likelihood  ratio  decision  rule 
after  taking  logarithms  and  rearranging  terms  can 
be  written  as 

(l/2)[(Ff-M„)r^,71(Ff-M„)- 
(V,  -  Mm  )T  VP~1  (V,  ~  M„, )] 

Hi 

>  Ml'J'm  I'/l'Pn  l)1/2> 

< 

Ho 

where,  |  A  |  is  the  determinant  of  matrix  A  . 


4.  PERFORMANCE  EVALUATION 
The  question  of  how  best  to  partition  a  data  set 
into  a  design  set  for  classifier  development  and  a 
test  set  for  testing  the  classifier  has  received  con¬ 
siderable  attention  [2-4].  The  cross-validation 
method  which  randomly  partitions  the  data  set 
into  two  mutually  exclusive  and  equi-sized  sets 
to  generate  the  design  set  and  test  set  can  be  em¬ 
ployed  to  evaluate  the  performance.  This 
method  is  effective  only  when  the  design  set  is 
large  enough  to  robustly  estimate  the  parameters 
of  the  classifier.  The  classification  accuracy  is 
estimated  as  the  fraction  of  the  number  of  cor¬ 
rectly  classified  vectors  in  the  test  set.  The  ran¬ 
dom  partitioning  can  be  repeated  H  times  (trials) 
and  the  classification  accuracy  is  then  estimated 
by  averaging  the  resulting  H  classification  accu¬ 
racies.  That  is,  the  classification  accuracy  is 
given  by 

H 

aH  =[(1//O^a,,]xl00%, 

h= 1 

where, 


number  of  correctly  classified  vectors 
number  of  vectors  in  the  test  set 


in  trial  h,  h=\,2,...fi. 


5.  GOUT 

Gout  is  a  form  of  arthritis  usually  caused  by  in¬ 
creased  levels  of  uric  acid  circulating  in  the 
blood  and  being  deposited  as  needle-like  crystals 
within  joints  and  tissues.  These  deposits  lead  to 
episodes  of  inflammatory  arthritis  which  results 
in  pain,  swelling,  redness,  and  damage  to  the 
joints.  Elevated  uric  acid  levels  alone,  however, 
are  not  sufficient  to  diagnose  gout  because  only 
10%  to  20%  of  individuals  with  high  levels  of 
uric  acid  develop  gout  [5].  Additionally,  the  uric 
acid  levels  in  the  blood  may  be  transiently  nor¬ 
mal  or  low  during  a  gout  attack  [6].  Gout  may 
be  difficult  to  diagnose  at  times  because  symp¬ 
toms  may  mimic  other  rheumatic  diseases  [7]. 
Conversely,  other  kinds  of  arthritis  can  mimic  a 
gout  attack.  Because  treatment  of  gout  is  spe¬ 
cific,  the  correct  diagnosis  is  essential. 

Risk  factors  for  gout  have  been  studied 
extensively  for  years  and  the  correlations  be¬ 
tween  the  risk  factors  and  gout  are  summarized 
in  reference  [8].  The  following  14  variables 
from  the  risk  factor  set  were  included  in  this 
study:  serum  uric  acid,  gender,  age  (at  diagnosis 
of  gout),  the  presence  or  absence  of  diabetes,  the 
presence  or  absence  of  hypertension,  weight, 
height,  body  surface  area,  history  of  kidney 
stones,  the  presence  or  absence  of  thiazide  diu- 
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reties,  serum  cholesterol,  triglycerides,  creatin¬ 
ine,  and  blood  urea  nitrogen. 

6.  GOUT  CLASSIFICATION  EXPERIMENTS 
The  computer  records  of  a  multi-specialty  group 
practice  were  searched  for  patients  with  a  diag¬ 
nosis  code  for  gout  who  had  an  office  visit  dur¬ 
ing  a  nine-month  period.  Of  91  charts  available 
for  review,  48  patients  were  identified  who  had 
information  available  for  all  parameters  under 
investigation.  The  diagnosis  of  gout  was  consid¬ 
ered  confirmed  by  a  rheumatologist  if  either  uric 
acid  crystals  were  identified  in  synovial  fluid,  or 
a  classic  attack  of  podagra  (acute  arthritis  of  the 
great  toe)  was  documented,  or  if  there  were  typi¬ 
cal  recurrent  attacks  of  monoarticular  arthritis  in 
the  absence  of  another  arthropathy  which  could 
account  for  the  symptoms. 

Forty-eight  patients  without  gout  were 
matched  for  gender  and  for  age  to  the  gout  pa¬ 
tients  and  were  identified  from  the  clinic  labora¬ 
tory  records  as  having  had  a  multi-channel  chem¬ 
istry  profile  performed  during  a  previous  three- 
month  period.  In  both  cohorts,  patients  were 
excluded  if  they  had  psoriasis,  or  a  lymphopro- 
liferative  disorder,  or  were  taking  uricosuric  ther¬ 
apy  or  allopurinol  at  the  time  the  laboratory  data 
were  available  because  all  could  alter  their  serum 
uric  acid  values.  The  control  group  excluded 
those  patients  taking  aspirin  (lowers  serum  uric 
acid),  but  patients  taking  other  non-steroidal  anti¬ 
inflammatory  drugs  were  included. 

Classifiers  to  predict  the  occurrence  of 
gout  were  developed  exactly  as  outlined  in  Sec¬ 
tions  2  and  3.  The  cross-validation  method  de¬ 
scribed  in  Section  4  was  used  to  evalute  the  per¬ 
formance.  The  estimated  classification  accura¬ 
cies  of  the  Gaussian  classifiers,  expressed  as 
percentages,  are  shown  in  Figure  2.  The  results 
are  presented  for  H=  50  and  for  k=  1,2,. ..,14, 
where,  k  is  the  number  of  rank  ordered  features 
in  the  transformed  feature  vectors.  For  each  of 
the  H=  50  random  partitions  into  a  training  and 
test  set,  the  DKLT  transformation  matrix  was 
computed  from  the  training  set  and  the  test  re¬ 
sults  are  shown  by  the  solid  lines  in  Figure  2. 
The  results  show  that  an  average  classification 
accuracy  of  75.7%  can  be  obtained  by  selecting 
the  first  three  highest  inter-class  separation  com¬ 
ponents  in  the  transformed  feature  vectors.  The 
dotted  lines  in  Figure  2  show  the  test  results  ob¬ 
tained  by  selecting,  through  trial  and  error,  a 
DKLT  transformation  matrix  that  gave  good 
results.  That  is,  the  DKLT  transformation  matrix 
was  computed  just  once  and  tested  H=  50  times 
with  randomly  selected  test  sets.  It  is  seen  that 


by  selecting  the  first  5  highest  inter-class  separa¬ 
tion  components,  a  classification  accuracy  of 
87.4%  can  be  achieved. 

7.  CONCLUSIONS 

This  paper  focused  on  developing  a  pattern  rec¬ 
ognition  methodology  to  detect  the  presence  or 
absence  of  a  disease  from  a  set  of  risk  factors 
correlated  with  the  disease.  The  generalized 
classification  methodology  developed  included 
fusion  to  combine  risk  factors  into  a  single  fea¬ 
ture  vector,  normalization  to  overcome  the  prob¬ 
lems  associated  with  fusing  features  which  have 
different  formats  and  ranges,  DKLT  based  trans¬ 
formation  to  facilitate  parametric  classifier  de¬ 
velopment,  feature  selection,  and  Gaussian  like¬ 
lihood  ratio  classifier  design.  The  methodology 
was  applied  to  detect  the  presence  or  absence  of 
gout  from  a  set  of  14  risk  factors  thought  to  be 
correlated  with  gout.  Cross-validation  evalua¬ 
tions  on  patients  clinically  diagnosed  to  have 
gout  and  not  have  gout  showed  that  on  the  aver¬ 
age,  a  classification  accuracy  of  75.7%  can  be 
obtained  which  is  quite  encouraging.  What  is 
even  more  promising  is  that  classification  accu¬ 
racies  of  over  87%  can  be  achieved  through  the 
careful  selection  of  the  DKLT  transformation 
matrix  which  in  turn  involves  selecting  training 
sets  that  are  good  representatives  of  the  gout  and 
non-gout  classes.  It  could,  therefore,  be  con¬ 
cluded  that  the  gout  and  non-gout  classes  could 
be  separated  even  more  accurately  if  larger  and 
more  representative  training  sets  were  available 
for  the  two  classes.  In  summary,  the  generalized 
disease  prediction  methodology  developed  in 
this  paper  can  assist  a  physician  in  diagnosing  a 
multifactorial  disease. 
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Fig.  1.  Block  diagram  of  the  risk  factor  fusion  approach  to  classify  gout. 


Fig.  2.  Classification  results. 
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